Case Study Scenario:

You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.

Ask

What to do in this section:

  • Define problem / objective
  • Ask effective question

Define Problem / Objective

The objective of this analysis is to learn more about the consumer (focused for women audience) data behaivour to help guide the marketing team.

Ask Question (SKIPPED)

This section supposed to have list of questions and answer that's happening when there's a kickoff meeting about the project. But, Since this is a case study, there's no QnA section. So we will skip this step. But if this step exist, This section goals is to know the objective of our Analysis Project and get to know more about the project.

Prepare

What to do in this section:

  • Understand how data is generated and collected
  • Identify data formats
  • Make sure the data is right
  • Organize and protect data
In [1]:
import os
import pandas as pd
In [2]:
# ignore any warning

import warnings

warnings.filterwarnings("ignore")
In [3]:
raw_data_path = "../raw_data/"

file_names = os.listdir(raw_data_path)
file_path = [f'{raw_data_path}{filename}' for filename in file_names]

print(file_names)
['minuteIntensitiesWide_merged.csv', 'dailyActivity_merged.csv', 'sleepDay_merged.csv', 'heartrate_seconds_merged.csv', 'minuteIntensitiesNarrow_merged.csv', 'minuteMETsNarrow_merged.csv', 'hourlySteps_merged.csv', 'hourlyIntensities_merged.csv', 'dailyCalories_merged.csv', 'hourlyCalories_merged.csv', 'minuteSleep_merged.csv', 'minuteCaloriesNarrow_merged.csv', 'dailySteps_merged.csv', 'minuteCaloriesWide_merged.csv', 'minuteStepsNarrow_merged.csv', 'minuteStepsWide_merged.csv', 'weightLogInfo_merged.csv', 'dailyIntensities_merged.csv']
In [4]:
# read all data

# Intensities
df_minuteIntensitiesWide = pd.read_csv(file_path[0])
df_minuteIntensitiesNarrow = pd.read_csv(file_path[4])
df_hourlyIntensities = pd.read_csv(file_path[7])
df_dailyIntensities = pd.read_csv(file_path[17])

# Calories
df_minuteCaloriesWide = pd.read_csv(file_path[13])
df_minuteCaloriesNarrow = pd.read_csv(file_path[11])
df_hourlyCalories = pd.read_csv(file_path[9])
df_dailyCalories = pd.read_csv(file_path[8])

# Step
df_minuteStepsWide = pd.read_csv(file_path[15])
df_minuteStepsNarrow = pd.read_csv(file_path[14])
df_hourlySteps = pd.read_csv(file_path[6])
df_dailySteps = pd.read_csv(file_path[12])

# Sleep
df_sleepDay = pd.read_csv(file_path[2])
df_minuteSleep = pd.read_csv(file_path[10])

# HeartRate
df_heartrate_seconds = pd.read_csv(file_path[3])

# Standalone
df_minuteMETsNarrow = pd.read_csv(file_path[5])
df_weightLogInfo = pd.read_csv(file_path[16])
df_dailyActivity = pd.read_csv(file_path[1])
In [5]:
df_dict = {
    'df_minuteIntensitiesWide' : df_minuteIntensitiesWide, 
    'df_minuteIntensitiesNarrow' : df_minuteIntensitiesNarrow, 
    'df_hourlyIntensities' : df_hourlyIntensities, 
    'df_dailyIntensities' : df_dailyIntensities,
    'df_minuteCaloriesWide' : df_minuteCaloriesWide, 
    'df_minuteCaloriesNarrow' : df_minuteCaloriesNarrow, 
    'df_hourlyCalories' : df_hourlyCalories, 
    'df_dailyCalories' : df_dailyCalories, 
    'df_minuteStepsWide' : df_minuteStepsWide, 
    'df_minuteStepsNarrow' : df_minuteStepsNarrow, 
    'df_hourlySteps' : df_hourlySteps, 
    'df_dailySteps' : df_dailySteps, 
    'df_sleepDay' : df_sleepDay, 
    'df_minuteSleep' : df_minuteSleep, 
    'df_heartrate_seconds' : df_heartrate_seconds,
    'df_minuteMETsNarrow' : df_minuteMETsNarrow, 
    'df_weightLogInfo' : df_weightLogInfo, 
    'df_dailyActivity' : df_dailyActivity
}
In [6]:
# this will help me to see all the variable that holds the table data

def getTableNames():
    global df_dict
    for df_name, df in df_dict.items():
        print(df_name)
In [7]:
def CheckDataFormat(df, df_name, is_wide=False):
    # I will check wether it's wide or not, because wide have lot of columns
    # and it's filling most of the sceen. so we'll seperate it
    if (('Wide' in df_name) == is_wide):
        print(f'Data Format for {df_name}')
        display(df.head(2))
        display(df.info())
In [8]:
for df_name, df in df_dict.items():
    CheckDataFormat(df, df_name, is_wide=False)
Data Format for df_minuteIntensitiesNarrow
Id ActivityMinute Intensity
0 1503960366 4/12/2016 12:00:00 AM 0
1 1503960366 4/12/2016 12:01:00 AM 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325580 entries, 0 to 1325579
Data columns (total 3 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   Id              1325580 non-null  int64 
 1   ActivityMinute  1325580 non-null  object
 2   Intensity       1325580 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 30.3+ MB
None
Data Format for df_hourlyIntensities
Id ActivityHour TotalIntensity AverageIntensity
0 1503960366 4/12/2016 12:00:00 AM 20 0.333333
1 1503960366 4/12/2016 1:00:00 AM 8 0.133333
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Id                22099 non-null  int64  
 1   ActivityHour      22099 non-null  object 
 2   TotalIntensity    22099 non-null  int64  
 3   AverageIntensity  22099 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 690.7+ KB
None
Data Format for df_dailyIntensities
Id ActivityDay SedentaryMinutes LightlyActiveMinutes FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
0 1503960366 4/12/2016 728 328 13 25 0.0 6.06 0.55 1.88
1 1503960366 4/13/2016 776 217 19 21 0.0 4.71 0.69 1.57
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        940 non-null    int64  
 1   ActivityDay               940 non-null    object 
 2   SedentaryMinutes          940 non-null    int64  
 3   LightlyActiveMinutes      940 non-null    int64  
 4   FairlyActiveMinutes       940 non-null    int64  
 5   VeryActiveMinutes         940 non-null    int64  
 6   SedentaryActiveDistance   940 non-null    float64
 7   LightActiveDistance       940 non-null    float64
 8   ModeratelyActiveDistance  940 non-null    float64
 9   VeryActiveDistance        940 non-null    float64
dtypes: float64(4), int64(5), object(1)
memory usage: 73.6+ KB
None
Data Format for df_minuteCaloriesNarrow
Id ActivityMinute Calories
0 1503960366 4/12/2016 12:00:00 AM 0.7865
1 1503960366 4/12/2016 12:01:00 AM 0.7865
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325580 entries, 0 to 1325579
Data columns (total 3 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   Id              1325580 non-null  int64  
 1   ActivityMinute  1325580 non-null  object 
 2   Calories        1325580 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 30.3+ MB
None
Data Format for df_hourlyCalories
Id ActivityHour Calories
0 1503960366 4/12/2016 12:00:00 AM 81
1 1503960366 4/12/2016 1:00:00 AM 61
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            22099 non-null  int64 
 1   ActivityHour  22099 non-null  object
 2   Calories      22099 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 518.1+ KB
None
Data Format for df_dailyCalories
Id ActivityDay Calories
0 1503960366 4/12/2016 1985
1 1503960366 4/13/2016 1797
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Id           940 non-null    int64 
 1   ActivityDay  940 non-null    object
 2   Calories     940 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 22.2+ KB
None
Data Format for df_minuteStepsNarrow
Id ActivityMinute Steps
0 1503960366 4/12/2016 12:00:00 AM 0
1 1503960366 4/12/2016 12:01:00 AM 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325580 entries, 0 to 1325579
Data columns (total 3 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   Id              1325580 non-null  int64 
 1   ActivityMinute  1325580 non-null  object
 2   Steps           1325580 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 30.3+ MB
None
Data Format for df_hourlySteps
Id ActivityHour StepTotal
0 1503960366 4/12/2016 12:00:00 AM 373
1 1503960366 4/12/2016 1:00:00 AM 160
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22099 entries, 0 to 22098
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            22099 non-null  int64 
 1   ActivityHour  22099 non-null  object
 2   StepTotal     22099 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 518.1+ KB
None
Data Format for df_dailySteps
Id ActivityDay StepTotal
0 1503960366 4/12/2016 13162
1 1503960366 4/13/2016 10735
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Id           940 non-null    int64 
 1   ActivityDay  940 non-null    object
 2   StepTotal    940 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 22.2+ KB
None
Data Format for df_sleepDay
Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
0 1503960366 4/12/2016 12:00:00 AM 1 327 346
1 1503960366 4/13/2016 12:00:00 AM 2 384 407
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413 entries, 0 to 412
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Id                  413 non-null    int64 
 1   SleepDay            413 non-null    object
 2   TotalSleepRecords   413 non-null    int64 
 3   TotalMinutesAsleep  413 non-null    int64 
 4   TotalTimeInBed      413 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 16.3+ KB
None
Data Format for df_minuteSleep
Id date value logId
0 1503960366 4/12/2016 2:47:30 AM 3 11380564589
1 1503960366 4/12/2016 2:48:30 AM 2 11380564589
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188521 entries, 0 to 188520
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   Id      188521 non-null  int64 
 1   date    188521 non-null  object
 2   value   188521 non-null  int64 
 3   logId   188521 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 5.8+ MB
None
Data Format for df_heartrate_seconds
Id Time Value
0 2022484408 4/12/2016 7:21:00 AM 97
1 2022484408 4/12/2016 7:21:05 AM 102
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2483658 entries, 0 to 2483657
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Id      int64 
 1   Time    object
 2   Value   int64 
dtypes: int64(2), object(1)
memory usage: 56.8+ MB
None
Data Format for df_minuteMETsNarrow
Id ActivityMinute METs
0 1503960366 4/12/2016 12:00:00 AM 10
1 1503960366 4/12/2016 12:01:00 AM 10
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325580 entries, 0 to 1325579
Data columns (total 3 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   Id              1325580 non-null  int64 
 1   ActivityMinute  1325580 non-null  object
 2   METs            1325580 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 30.3+ MB
None
Data Format for df_weightLogInfo
Id Date WeightKg WeightPounds Fat BMI IsManualReport LogId
0 1503960366 5/2/2016 11:59:59 PM 52.599998 115.963147 22.0 22.65 True 1462233599000
1 1503960366 5/3/2016 11:59:59 PM 52.599998 115.963147 NaN 22.65 True 1462319999000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67 entries, 0 to 66
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Id              67 non-null     int64  
 1   Date            67 non-null     object 
 2   WeightKg        67 non-null     float64
 3   WeightPounds    67 non-null     float64
 4   Fat             2 non-null      float64
 5   BMI             67 non-null     float64
 6   IsManualReport  67 non-null     bool   
 7   LogId           67 non-null     int64  
dtypes: bool(1), float64(4), int64(2), object(1)
memory usage: 3.9+ KB
None
Data Format for df_dailyActivity
Id ActivityDate TotalSteps TotalDistance TrackerDistance LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
0 1503960366 4/12/2016 13162 8.50 8.50 0.0 1.88 0.55 6.06 0.0 25 13 328 728 1985
1 1503960366 4/13/2016 10735 6.97 6.97 0.0 1.57 0.69 4.71 0.0 21 19 217 776 1797
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940 entries, 0 to 939
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Id                        940 non-null    int64  
 1   ActivityDate              940 non-null    object 
 2   TotalSteps                940 non-null    int64  
 3   TotalDistance             940 non-null    float64
 4   TrackerDistance           940 non-null    float64
 5   LoggedActivitiesDistance  940 non-null    float64
 6   VeryActiveDistance        940 non-null    float64
 7   ModeratelyActiveDistance  940 non-null    float64
 8   LightActiveDistance       940 non-null    float64
 9   SedentaryActiveDistance   940 non-null    float64
 10  VeryActiveMinutes         940 non-null    int64  
 11  FairlyActiveMinutes       940 non-null    int64  
 12  LightlyActiveMinutes      940 non-null    int64  
 13  SedentaryMinutes          940 non-null    int64  
 14  Calories                  940 non-null    int64  
dtypes: float64(7), int64(7), object(1)
memory usage: 110.3+ KB
None
In [9]:
for df_name, df in df_dict.items():
    CheckDataFormat(df, df_name, is_wide=True)
Data Format for df_minuteIntensitiesWide
Id ActivityHour Intensity00 Intensity01 Intensity02 Intensity03 Intensity04 Intensity05 Intensity06 Intensity07 ... Intensity50 Intensity51 Intensity52 Intensity53 Intensity54 Intensity55 Intensity56 Intensity57 Intensity58 Intensity59
0 1503960366 4/13/2016 12:00:00 AM 1 1 0 0 0 1 0 1 ... 0 1 1 0 1 1 0 0 0 0
1 1503960366 4/13/2016 1:00:00 AM 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2 rows × 62 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21645 entries, 0 to 21644
Data columns (total 62 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            21645 non-null  int64 
 1   ActivityHour  21645 non-null  object
 2   Intensity00   21645 non-null  int64 
 3   Intensity01   21645 non-null  int64 
 4   Intensity02   21645 non-null  int64 
 5   Intensity03   21645 non-null  int64 
 6   Intensity04   21645 non-null  int64 
 7   Intensity05   21645 non-null  int64 
 8   Intensity06   21645 non-null  int64 
 9   Intensity07   21645 non-null  int64 
 10  Intensity08   21645 non-null  int64 
 11  Intensity09   21645 non-null  int64 
 12  Intensity10   21645 non-null  int64 
 13  Intensity11   21645 non-null  int64 
 14  Intensity12   21645 non-null  int64 
 15  Intensity13   21645 non-null  int64 
 16  Intensity14   21645 non-null  int64 
 17  Intensity15   21645 non-null  int64 
 18  Intensity16   21645 non-null  int64 
 19  Intensity17   21645 non-null  int64 
 20  Intensity18   21645 non-null  int64 
 21  Intensity19   21645 non-null  int64 
 22  Intensity20   21645 non-null  int64 
 23  Intensity21   21645 non-null  int64 
 24  Intensity22   21645 non-null  int64 
 25  Intensity23   21645 non-null  int64 
 26  Intensity24   21645 non-null  int64 
 27  Intensity25   21645 non-null  int64 
 28  Intensity26   21645 non-null  int64 
 29  Intensity27   21645 non-null  int64 
 30  Intensity28   21645 non-null  int64 
 31  Intensity29   21645 non-null  int64 
 32  Intensity30   21645 non-null  int64 
 33  Intensity31   21645 non-null  int64 
 34  Intensity32   21645 non-null  int64 
 35  Intensity33   21645 non-null  int64 
 36  Intensity34   21645 non-null  int64 
 37  Intensity35   21645 non-null  int64 
 38  Intensity36   21645 non-null  int64 
 39  Intensity37   21645 non-null  int64 
 40  Intensity38   21645 non-null  int64 
 41  Intensity39   21645 non-null  int64 
 42  Intensity40   21645 non-null  int64 
 43  Intensity41   21645 non-null  int64 
 44  Intensity42   21645 non-null  int64 
 45  Intensity43   21645 non-null  int64 
 46  Intensity44   21645 non-null  int64 
 47  Intensity45   21645 non-null  int64 
 48  Intensity46   21645 non-null  int64 
 49  Intensity47   21645 non-null  int64 
 50  Intensity48   21645 non-null  int64 
 51  Intensity49   21645 non-null  int64 
 52  Intensity50   21645 non-null  int64 
 53  Intensity51   21645 non-null  int64 
 54  Intensity52   21645 non-null  int64 
 55  Intensity53   21645 non-null  int64 
 56  Intensity54   21645 non-null  int64 
 57  Intensity55   21645 non-null  int64 
 58  Intensity56   21645 non-null  int64 
 59  Intensity57   21645 non-null  int64 
 60  Intensity58   21645 non-null  int64 
 61  Intensity59   21645 non-null  int64 
dtypes: int64(61), object(1)
memory usage: 10.2+ MB
None
Data Format for df_minuteCaloriesWide
Id ActivityHour Calories00 Calories01 Calories02 Calories03 Calories04 Calories05 Calories06 Calories07 ... Calories50 Calories51 Calories52 Calories53 Calories54 Calories55 Calories56 Calories57 Calories58 Calories59
0 1503960366 4/13/2016 12:00:00 AM 1.8876 2.2022 0.9438 0.9438 0.9438 2.0449 0.9438 2.2022 ... 0.9438 2.0449 2.0449 0.9438 2.3595 1.8876 0.9438 0.9438 0.9438 0.9438
1 1503960366 4/13/2016 1:00:00 AM 0.7865 0.7865 0.7865 0.7865 0.9438 0.9438 0.9438 0.7865 ... 0.7865 0.7865 0.7865 0.7865 0.7865 0.7865 0.7865 0.7865 0.7865 0.7865

2 rows × 62 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21645 entries, 0 to 21644
Data columns (total 62 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            21645 non-null  int64  
 1   ActivityHour  21645 non-null  object 
 2   Calories00    21645 non-null  float64
 3   Calories01    21645 non-null  float64
 4   Calories02    21645 non-null  float64
 5   Calories03    21645 non-null  float64
 6   Calories04    21645 non-null  float64
 7   Calories05    21645 non-null  float64
 8   Calories06    21645 non-null  float64
 9   Calories07    21645 non-null  float64
 10  Calories08    21645 non-null  float64
 11  Calories09    21645 non-null  float64
 12  Calories10    21645 non-null  float64
 13  Calories11    21645 non-null  float64
 14  Calories12    21645 non-null  float64
 15  Calories13    21645 non-null  float64
 16  Calories14    21645 non-null  float64
 17  Calories15    21645 non-null  float64
 18  Calories16    21645 non-null  float64
 19  Calories17    21645 non-null  float64
 20  Calories18    21645 non-null  float64
 21  Calories19    21645 non-null  float64
 22  Calories20    21645 non-null  float64
 23  Calories21    21645 non-null  float64
 24  Calories22    21645 non-null  float64
 25  Calories23    21645 non-null  float64
 26  Calories24    21645 non-null  float64
 27  Calories25    21645 non-null  float64
 28  Calories26    21645 non-null  float64
 29  Calories27    21645 non-null  float64
 30  Calories28    21645 non-null  float64
 31  Calories29    21645 non-null  float64
 32  Calories30    21645 non-null  float64
 33  Calories31    21645 non-null  float64
 34  Calories32    21645 non-null  float64
 35  Calories33    21645 non-null  float64
 36  Calories34    21645 non-null  float64
 37  Calories35    21645 non-null  float64
 38  Calories36    21645 non-null  float64
 39  Calories37    21645 non-null  float64
 40  Calories38    21645 non-null  float64
 41  Calories39    21645 non-null  float64
 42  Calories40    21645 non-null  float64
 43  Calories41    21645 non-null  float64
 44  Calories42    21645 non-null  float64
 45  Calories43    21645 non-null  float64
 46  Calories44    21645 non-null  float64
 47  Calories45    21645 non-null  float64
 48  Calories46    21645 non-null  float64
 49  Calories47    21645 non-null  float64
 50  Calories48    21645 non-null  float64
 51  Calories49    21645 non-null  float64
 52  Calories50    21645 non-null  float64
 53  Calories51    21645 non-null  float64
 54  Calories52    21645 non-null  float64
 55  Calories53    21645 non-null  float64
 56  Calories54    21645 non-null  float64
 57  Calories55    21645 non-null  float64
 58  Calories56    21645 non-null  float64
 59  Calories57    21645 non-null  float64
 60  Calories58    21645 non-null  float64
 61  Calories59    21645 non-null  float64
dtypes: float64(60), int64(1), object(1)
memory usage: 10.2+ MB
None
Data Format for df_minuteStepsWide
Id ActivityHour Steps00 Steps01 Steps02 Steps03 Steps04 Steps05 Steps06 Steps07 ... Steps50 Steps51 Steps52 Steps53 Steps54 Steps55 Steps56 Steps57 Steps58 Steps59
0 1503960366 4/13/2016 12:00:00 AM 4 16 0 0 0 9 0 17 ... 0 9 8 0 20 1 0 0 0 0
1 1503960366 4/13/2016 1:00:00 AM 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2 rows × 62 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21645 entries, 0 to 21644
Data columns (total 62 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            21645 non-null  int64 
 1   ActivityHour  21645 non-null  object
 2   Steps00       21645 non-null  int64 
 3   Steps01       21645 non-null  int64 
 4   Steps02       21645 non-null  int64 
 5   Steps03       21645 non-null  int64 
 6   Steps04       21645 non-null  int64 
 7   Steps05       21645 non-null  int64 
 8   Steps06       21645 non-null  int64 
 9   Steps07       21645 non-null  int64 
 10  Steps08       21645 non-null  int64 
 11  Steps09       21645 non-null  int64 
 12  Steps10       21645 non-null  int64 
 13  Steps11       21645 non-null  int64 
 14  Steps12       21645 non-null  int64 
 15  Steps13       21645 non-null  int64 
 16  Steps14       21645 non-null  int64 
 17  Steps15       21645 non-null  int64 
 18  Steps16       21645 non-null  int64 
 19  Steps17       21645 non-null  int64 
 20  Steps18       21645 non-null  int64 
 21  Steps19       21645 non-null  int64 
 22  Steps20       21645 non-null  int64 
 23  Steps21       21645 non-null  int64 
 24  Steps22       21645 non-null  int64 
 25  Steps23       21645 non-null  int64 
 26  Steps24       21645 non-null  int64 
 27  Steps25       21645 non-null  int64 
 28  Steps26       21645 non-null  int64 
 29  Steps27       21645 non-null  int64 
 30  Steps28       21645 non-null  int64 
 31  Steps29       21645 non-null  int64 
 32  Steps30       21645 non-null  int64 
 33  Steps31       21645 non-null  int64 
 34  Steps32       21645 non-null  int64 
 35  Steps33       21645 non-null  int64 
 36  Steps34       21645 non-null  int64 
 37  Steps35       21645 non-null  int64 
 38  Steps36       21645 non-null  int64 
 39  Steps37       21645 non-null  int64 
 40  Steps38       21645 non-null  int64 
 41  Steps39       21645 non-null  int64 
 42  Steps40       21645 non-null  int64 
 43  Steps41       21645 non-null  int64 
 44  Steps42       21645 non-null  int64 
 45  Steps43       21645 non-null  int64 
 46  Steps44       21645 non-null  int64 
 47  Steps45       21645 non-null  int64 
 48  Steps46       21645 non-null  int64 
 49  Steps47       21645 non-null  int64 
 50  Steps48       21645 non-null  int64 
 51  Steps49       21645 non-null  int64 
 52  Steps50       21645 non-null  int64 
 53  Steps51       21645 non-null  int64 
 54  Steps52       21645 non-null  int64 
 55  Steps53       21645 non-null  int64 
 56  Steps54       21645 non-null  int64 
 57  Steps55       21645 non-null  int64 
 58  Steps56       21645 non-null  int64 
 59  Steps57       21645 non-null  int64 
 60  Steps58       21645 non-null  int64 
 61  Steps59       21645 non-null  int64 
dtypes: int64(61), object(1)
memory usage: 10.2+ MB
None

As it turns out, most of the data have no nulls in them but df_weightLogInfo, with a null value for almost every data points in the fat feature. Another thing to point out is that every data have an ID atribute which I belive it's a foreign key from a user table. Sadly there's no user table from this data, but we can use the ID to join most of the data together. All of the data also contains datetime data that's stored as a string, so we need to change it to datetime so we can manipulate it, and it will be easier for us to visualize later.

Now, we're going to transform our datetime string data into a datetime format. To do that we need to know what column does have the datetime format. to do this we can use string matching, but after looking at the dataset, there's only 1 column with object datatype and it's the datetime data! it'll save a little bit of time rather than doing a string manipulation.

In [10]:
def StringtoDate(df):
    object_columns = df.select_dtypes(include=['object']).columns
    for object_column in object_columns:
        df[object_column] = pd.to_datetime(df[object_column])
        # use "_" because there's a column called Date already
        df['Date_'] = df[object_column].apply(lambda dt:dt.date())
        # Time is a datetime with a placeholder date (ignore the date)
        # I'm not using .time() because I can't seem to find a way to visualize it
        df['Time_'] = df[object_column].apply(lambda dt:dt.replace(day=1,year=2000,month=1))
In [11]:
for df_name, df in df_dict.items():
    StringtoDate(df)
    print(f'{df_name} StringtoDate Done!')
df_minuteIntensitiesWide StringtoDate Done!
df_minuteIntensitiesNarrow StringtoDate Done!
df_hourlyIntensities StringtoDate Done!
df_dailyIntensities StringtoDate Done!
df_minuteCaloriesWide StringtoDate Done!
df_minuteCaloriesNarrow StringtoDate Done!
df_hourlyCalories StringtoDate Done!
df_dailyCalories StringtoDate Done!
df_minuteStepsWide StringtoDate Done!
df_minuteStepsNarrow StringtoDate Done!
df_hourlySteps StringtoDate Done!
df_dailySteps StringtoDate Done!
df_sleepDay StringtoDate Done!
df_minuteSleep StringtoDate Done!
df_heartrate_seconds StringtoDate Done!
df_minuteMETsNarrow StringtoDate Done!
df_weightLogInfo StringtoDate Done!
df_dailyActivity StringtoDate Done!

I also could've change ID to string instead of number. But it doesn't really effect anything, because we're not going to manipulate it

Because this is a case study, I don't have the luxury to have a question to anyone. But, if I were able to ask, I would like to ask

  • Is it possible to get a data that can distinguish gender based on its ID?, because our goal is to have a better marketing for woman audience but the data doesn't seem to have any way of distinguish it. Since we don't have the data to distinguish user gender, I will increase the scope so we will analyze both women and men.
  • is it permissible to share this data? are there any limitations?, I need to ask this so I know who to ask and not to ask. This question is also really important so we know who is able to acess/view the data, with this knowledge it will prevent me to accidently share the data to someone that I'm not supposed to share with.

Process

What to do in this section:

  • Create & transform data
  • Maintain Data Integrity
  • Make sure data is unbiased and credible
  • Test Data
  • Clean Data
  • Verify and report on cleaning results

as we already check, the data is already clean from null data. So, let's take a look at the datatime data, because there might be something off / unusual about it.

In [12]:
# import library that'll be used

import plotly.express as px
import seaborn as sns

import math
import numpy as np
In [13]:
def plot_minute_distribution(df, df_name):
    # I'm not going to analyze wide data format, because it has the same value with narrow,
    # And I'm just more comfortable with narrow data format
    if ('minute' in df_name) & ('Wide' not in df_name):
        
        temp = df.copy()
        temp['count'] = 1
        temp = temp[['Time_', 'count']].groupby('Time_').sum()
        temp10min = temp.resample('10min').mean()

        fig = px.bar(
            temp10min,
            x=temp10min.index, 
            y='count'
        )

        fig.update_layout(
            title=f'Minute data distribution for {df_name}',
            xaxis_title='Time (in a day)',
            yaxis_title='Count'
        )

        fig.show()
In [14]:
for df_name, df in df_dict.items():
    plot_minute_distribution(df, df_name)

it turns out that every minute data except sleep is always recorded, but sleep data only have the data when there's a sleeping activity detected.

In [15]:
def plot_daily_distribution(df, df_name):
    # I'm not going to analyze wide data format, because it has the same value with narrow,
    # And I'm just more comfortable with narrow data format
    if ('minute' not in df_name):
        
        temp = df.copy()
        temp['count'] = 1
        temp = temp[['Date_', 'count']].groupby('Date_').sum()

        fig = px.bar(
            temp,
            x=temp.index, 
            y='count'
        )

        fig.update_layout(
            title=f'Daily data distribution for {df_name}',
            xaxis_title='Time (in a day)',
            yaxis_title='Count'
        )

        fig.show()
In [16]:
for df_name, df in df_dict.items():
    plot_daily_distribution(df, df_name)

from the plot above, we all got an information that every data date start and end the same month. But we have weight data that we only have a little bit of data points, Let's check how much ID does this data cover and what the weight data looks like!

In [17]:
fig = px.line(df_weightLogInfo, 
              x='Date_', 
              y="WeightKg", 
              title='Weight Data plot',
              color='Id',
)

fig.add_scatter(x=df_weightLogInfo['Date_'], 
                y=df_weightLogInfo['WeightKg'],
                mode='markers', 
                marker_size=5,
                fillcolor='red'
)
fig.show()

print(f'number of unique ID : {df_weightLogInfo["Id"].nunique()}')
print(f'df_weightLogInfo data point couts : {len(df_weightLogInfo)}')
print('Weight ID value counts:')
display(df_weightLogInfo['Id'].value_counts())
number of unique ID : 8
df_weightLogInfo data point couts : 67
Weight ID value counts:
6962181067    30
8877689391    24
4558609924     5
2873212765     2
1503960366     2
4319703577     2
5577150313     1
1927972279     1
Name: Id, dtype: int64

There's only 8 Id (out of 33) that have weight data within 67 data points, and 54 of those belongs to 2 Id. Afterall, those weight data doesnt have any significant changes, so we might not be able to get anything from changes of the weight data. So, I will not use the weight data. It's a bad data and can lead to a bias insight. Since our goal is to have a better marketing, and weight is not an option. There's 2 other data that have some correlation with weight, which is Calories and METs. we will cover that later in the Analyze step.

Next, I will make cluster manualy for main features of the selected daily table which is calories, distance, And steps. I'm not including intensities because it has cluster already, and I also doesn't create heartrate cluster, because from my research from several article, every people have their own resting and active heartrate.

For Calories, I will divide it into 9 cluster, which start with less than 1k calories burn, and ends with more than 4.5k calories burn. in between I split it evenly with 500 differences, so we have 1k-1.5k, 1.5k-2k and so on.

In [18]:
df_minuteCaloriesNarrow
Out[18]:
Id ActivityMinute Calories Date_ Time_
0 1503960366 2016-04-12 00:00:00 0.78650 2016-04-12 2000-01-01 00:00:00
1 1503960366 2016-04-12 00:01:00 0.78650 2016-04-12 2000-01-01 00:01:00
2 1503960366 2016-04-12 00:02:00 0.78650 2016-04-12 2000-01-01 00:02:00
3 1503960366 2016-04-12 00:03:00 0.78650 2016-04-12 2000-01-01 00:03:00
4 1503960366 2016-04-12 00:04:00 0.78650 2016-04-12 2000-01-01 00:04:00
... ... ... ... ... ...
1325575 8877689391 2016-05-12 13:55:00 1.33353 2016-05-12 2000-01-01 13:55:00
1325576 8877689391 2016-05-12 13:56:00 1.33353 2016-05-12 2000-01-01 13:56:00
1325577 8877689391 2016-05-12 13:57:00 1.33353 2016-05-12 2000-01-01 13:57:00
1325578 8877689391 2016-05-12 13:58:00 1.33353 2016-05-12 2000-01-01 13:58:00
1325579 8877689391 2016-05-12 13:59:00 1.33353 2016-05-12 2000-01-01 13:59:00

1325580 rows × 5 columns

In [19]:
calories_clust = ['< 1000'] + [f'{(n-1)*500} - {n*500}' for n in range(2, 10)] + ['> 4500']

def getCaloriesCluster(x):
    global cluster
    if x > 4500:
        return '> 4500'
    if x < 1000:
        return '< 1000'
    else:
        return calories_clust[math.floor(x/500)]
    
for df_name, df in df_dict.items():
    # I'll only apply it to data that have calories as a feature, and holds the data daily
    if ('Calories' in df.columns) & ('minute' not in df_name) & ('hour' not in df_name):
        df['CaloriesCluster'] = df['Calories'].apply(getCaloriesCluster)

For Distance and Steps, I'm going to cluster them based on the quantile value. so, it'll have almost the same amount of data for every cluster.

In [20]:
def createDistanceCluster(x):
    if x <= 3:
        return "0 - 3 Miles"
    if x <= 6:
        return "3 - 6 Miles"
    if x <= 8:
        return "6 - 8 Miles"
    if x > 8:
        return "> 8 Miles"
    else:
        return "no cluster"
    
distance_cluster_order = ["0 - 3 Miles", "3 - 6 Miles", "6 - 8 Miles", "> 8 Miles"]
df_dailyActivity['distanceCluster'] = df_dailyActivity['TotalDistance'].apply(createDistanceCluster)
In [21]:
def createStepCluster(x):
    if x <= 3800:
        return "0 - 3.8k steps"
    if x <= 7500:
        return "3.8 - 7.5k steps"
    if x <= 11000:
        return "7.5k - 11k steps"
    if x > 1100:
        return "> 11k steps"
    else:
        return "no cluster"
    
step_cluster_order = ["0 - 3.8k steps", "3.8 - 7.5k steps", "7.5k - 11k steps", "> 11k steps"]
df_dailyActivity['stepCluster'] = df_dailyActivity['TotalSteps'].apply(createStepCluster)

I will also make a new feature to know which day of week is it, and check wether it's weekend or not. Because, people usually have a schedule on when they wanted to work out.

In [22]:
df_dailyActivity['ActivityDate'] = pd.to_datetime(df_dailyActivity['ActivityDate'])

df_dailyActivity['dayofweek'] = [d.weekday() for d in df_dailyActivity['ActivityDate']]
df_dailyActivity['isWeekend'] = df_dailyActivity['dayofweek'].apply(lambda x:x>=5)

# dayofweek still represented in 0-6 value, Let's change it
dayofweek_dict = {
    0 : 'Monday',
    1 : 'Tuesday',
    2 : 'Wednesday',
    3 : 'Thursday',
    4 : 'Friday',
    5 : 'Saturday',
    6 : 'Sunday'
}

df_dailyActivity['dayofweek'] = df_dailyActivity['dayofweek'].apply(lambda x:dayofweek_dict[x])

for sleep time, I will divide it into 3 cluster, 0-7 hours as less optimal sleep, 7-9 hours as enough sleep and 9+ hours for oversleep. I'm using those number based on my personal research on google, and most of the article that I read stated that 7-9 hours is how much an 18+ years old human needs to sleep for.

In [23]:
activeMinutesCol = ['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes']

df_dailyActivity['TotalActiveMinutes'] = df_dailyActivity[activeMinutesCol].sum(axis=1)

Prepare & Process Section Summary

Data Pre-processing
  • Change all Datetime feature that's stored as string, convert it into pandas datetime object.
  • Create 2 new features for every table called 'Date' and 'Time' which's generated by splitting Datetime data into date and time. Adding underscore for the feature name will prevent me from replacing an existing feature.
  • For more detail on date data within dailyActivity table, I also created 2 new features that tells what day of week it is and wether it's weekend or not.
  • Add TotalActiveMinutes for dailyActivity table. TotalActiveMinutes accumulate veryActiveMinutes, FairlyActivesMinutes, and LightlyActiveMinutes. I'm not including sedentary Minutes for the calculation, because based on dictionary it's considered little to no exercise, so it wont effect much to the calories.
Clustering
  • Create cluster manually for sleep, calories, distance, and steps features.
Data Cleaning
  • There's no data cleaning activity, so there's no documentation about it. but if there's any, I will compile what data cleaning I do at the end so for the next person who'll use the data, can use the code/steps that I did.
Data Selection
  • Exclude Wide data, because we already have the same data with a narrow format. I use the narrow format because I'm more familiar with it, so I can handle it better.
  • Exclude Weight data, because the data is insufficient and might led to bias analysis.

Analyze

What to do in this section:

  • Format & Transform
  • Sort & Filter
  • Identify Patterns
  • Make predictions or Recommendations
  • Make data-driven decisions

Our focus for this project is to market it better. For a healthy app, the purpose is to track user behaviour and tell the user on how healthy the user is. Based on my knowledge, there are 2 things that'll have a huge impact on your health state. It's "how active you are", and "what you consume". With this app, we can track one of the factor, which is "how active you are". to determine how active a user is from the data we have, we can see it from their calories burn and their metabolic rate (METs).

In [24]:
# data visualization library
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

# statistics library
import numpy as np
from scipy.stats import pearsonr

from datetime import datetime
In [25]:
fig = px.histogram(
    df_dailyActivity, 
    x="Calories", 
    marginal="rug"
)

fig.update_layout(
    title=f'Calories burn distribution within a day',
    xaxis_title='Calories Burn',
    yaxis_title='Count'
)
fig.show()

let's also look at the data distribution for calories should look like

Women Daily Calories Men Daily Calories

The data that we have might consist of both male and female audience. and there's no way of distinguishing those with calories data. But, we have a the data distribution looks like the one that I found on the internet, and it's a good thing. Next, we'll have a look at how the calories burn and METs changes troughout a day. To do this, I will take an example of a user and look at the differences within a day (a day have 1440 minute).

In [26]:
sampleMETs_data = df_minuteMETsNarrow[df_minuteMETsNarrow['Id'] == 1503960366]
sampleCal_data = df_minuteCaloriesNarrow[df_minuteCaloriesNarrow['Id'] == 1503960366]
In [27]:
display(sampleCal_data.head(3))
print(f'number of Calories data point (sample) : {len(sampleCal_data)}')
print(f'calories max value: {sampleCal_data["Calories"].max()}')
display(sampleMETs_data.head(3))
print(f'number of METs data point (sample) : {len(sampleMETs_data)}')
print(f'METs max value : {sampleMETs_data["METs"].max()}')
Id ActivityMinute Calories Date_ Time_
0 1503960366 2016-04-12 00:00:00 0.7865 2016-04-12 2000-01-01 00:00:00
1 1503960366 2016-04-12 00:01:00 0.7865 2016-04-12 2000-01-01 00:01:00
2 1503960366 2016-04-12 00:02:00 0.7865 2016-04-12 2000-01-01 00:02:00
number of Calories data point (sample) : 43020
calories max value: 8.73015022277832
Id ActivityMinute METs Date_ Time_
0 1503960366 2016-04-12 00:00:00 10 2016-04-12 2000-01-01 00:00:00
1 1503960366 2016-04-12 00:01:00 10 2016-04-12 2000-01-01 00:01:00
2 1503960366 2016-04-12 00:02:00 10 2016-04-12 2000-01-01 00:02:00
number of METs data point (sample) : 43020
METs max value : 111

the feature activityMinutes for both of the table is exactly the same, so we can easiely get the first 1440 minutes (a day) and look at the difference. But since METs have a higher value than calories, I will normalize it in respect to calories, so both of them holds the same range of value.

In [28]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=sampleCal_data.head(1440)['ActivityMinute'], y=sampleCal_data.head(1500)['Calories'],
                    mode='lines',
                    name='Calories'))
fig.add_trace(go.Scatter(x=sampleMETs_data.head(1440)['ActivityMinute'], y=(sampleMETs_data.head(1500)['METs']/10) + 9,
                    mode='lines',
                    name='METs (normalized)'))

fig.update_layout(title='Calories and METs (normalized) changes within a day',
                   xaxis_title='Timeline',
                   yaxis_title='Value')

fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)

fig.show()
In [29]:
corr, _ = pearsonr(df_minuteMETsNarrow['METs'], df_minuteCaloriesNarrow['Calories'])
print(f'Calories and METs Correlation : {corr}')
Calories and METs Correlation : 0.9559606262926739

so Calories and METs is almost perfectly positive correlated. because it's highly correlated, we can use of the feature to represent the other feature that's highly correlated. In this case, I will use calories. I choose it because it's a more known term rather than METs. Next, I will check wether certain days have an impact for calories burn.

In [30]:
fig = px.box(df_dailyActivity, x="dayofweek", y="Calories")

fig.update_layout(
    title='Calories Burn based on the day',
    xaxis_title='Day',
    yaxis_title='Calories Burn'
)
fig.show()

# ------------------------

fig = px.box(df_dailyActivity, x="isWeekend", y="Calories")

fig.update_layout(
    title='Calories burn during weekdays and weekend',
    xaxis_title='Is it weekend?',
    yaxis_title='Calories Burn'
)

fig.show()

based on the day of week, It's hard to spot any pattern. But for weekdays vs weekend, people are tend to be more active. We can see that from the lower tail of the plot for weekend is higher, and the overall box range is smaller. Which means that most of the data lies between those boxes. But after more analysis, I think we can't state "in weekend people tend to be more active", why? because the lower tail changes from 403 on weekdays to 1032 on weekend, And based on the calories burn for men and women that we had shown, there's <1% of women that burns that amount of calories. so these changes might be happening because of misscalculation or the app doesn't calculate 24 hours of the user calories burn, so it's less than what it's supposed to be.

Next, I will check on distance and steps feature!

In [31]:
df_dailyActivity[['TotalSteps', 'TotalDistance', 'TotalActiveMinutes', 'Calories']].describe().T
Out[31]:
count mean std min 25% 50% 75% max
TotalSteps 940.0 7637.910638 5087.150742 0.0 3789.75 7405.500 10727.0000 36019.000000
TotalDistance 940.0 5.489702 3.924606 0.0 2.62 5.245 7.7125 28.030001
TotalActiveMinutes 940.0 227.542553 121.776307 0.0 146.75 247.000 317.2500 552.000000
Calories 940.0 2303.609574 718.166862 0.0 1828.50 2134.000 2793.2500 4900.000000
In [32]:
fig = px.scatter(
    df_dailyActivity, 
    x="TotalSteps", 
    y="TotalDistance",
    trendline="ols",
)

fig.update_layout(
    title="Total Step vs Total Distance",
    xaxis_title='Total Step',
    yaxis_title='Total Distance (Miles)'
)

fig.data[1].line.color = 'red'
fig.show()

as the step goes up, the distance goes up aswell. We can draw a conclusion that it's highly correlated possitively. But we can see some of the data is a little bit off from the other. Let's take a closer look.

In [33]:
df_high_dist_steps = df_dailyActivity.loc[(
    df_dailyActivity['TotalSteps'] > 15000) | (df_dailyActivity['TotalDistance'] > 10)]
In [34]:
df_high_dist_steps["Id"] = df_high_dist_steps["Id"].apply(str)
In [35]:
fig = px.scatter(df_high_dist_steps, 
                 x="TotalSteps", 
                 y="TotalDistance",
                 color="Id",
                 title="Total Distance vs Total Step (Zoomed in on higher value)")
fig.show()

it turns out that all of those "strange" data is caused by one user only. This might be caused by an error in the system, or this user might be built differently. So he can reach more distance with less step than others.

Since there's a little gap between steps and distance, my instinct tells me that it is because of running will lead to higher travel distance than walking. So, this information about step and distance differences can represent the intensities of the steps.

In [36]:
fig = px.box(df_dailyActivity, x="stepCluster", y="Calories",
             facet_col="distanceCluster", category_orders={
                 "stepCluster": step_cluster_order, 
                 "distanceCluster": distance_cluster_order
             },
             title='Distance and Steps vs Calories'
)

fig.show()

from the visualization above, we can see that for a cluster of distance, the less steps have a higher median calories burn. So, from this visualization we have an information that the phase/intensity of the walk is more important than the distance. and also, the difference between the clusters is quite huge (3 - 5km difference).

Next, let's check on how steps, distance, and active minutes will effect calories burn. But, I will skip distance because we already know that steps & distance is highly correlated, so the graph will look very similar.

In [37]:
fig = px.scatter(df_dailyActivity, 
                 x="TotalSteps", 
                 y="Calories",
                 trendline="ols",
                 title="Calories vs Total Steps",
                 hover_data=["Id"]
)
fig.data[1].line.color = 'red'
fig.show()

fig = px.scatter(df_dailyActivity, 
                 x="TotalActiveMinutes", 
                 y="Calories",
                 trendline="ols",
                 title="Calories vs Total Active Minutes",
                 hover_data=["Id"]
)
fig.data[1].line.color = 'red'
fig.show()

it turns out that total steps have more effect than total active minutes. Since we know that total active minutes is calculated from 3 level of active minutes. Let's take a look at how much impact does each level has on calories.

In [38]:
df_activeMinutes_summary = df_dailyActivity[
    ['CaloriesCluster',
     'VeryActiveMinutes', 
     'FairlyActiveMinutes', 
     'LightlyActiveMinutes',
     'Calories',
     ]
].groupby("CaloriesCluster").mean().reset_index()

# data cluster wasn't sorted out correctly, so we manually sort it
# the problem is that the calorie cluster <1000 is placed at the end
# where it should be at the very first.
# and this will effect the visualization sequence later.
df_activeMinutes_summary["sequence"] = range(1,len(df_activeMinutes_summary)+1)
df_activeMinutes_summary.loc[df_activeMinutes_summary.index==7, 'sequence'] = 0
df_activeMinutes_summary = df_activeMinutes_summary.sort_values("sequence").drop('sequence', axis=1).reset_index(drop=True)
In [39]:
df_activeDistance_summary = df_dailyActivity[
    ['CaloriesCluster',
     'VeryActiveDistance', 
     'ModeratelyActiveDistance', 
     'LightActiveDistance',
     'Calories',
     ]
].groupby("CaloriesCluster").mean().reset_index()

# data cluster wasn't sorted out correctly, so we manually sort it
# the problem is that the calorie cluster <1000 is placed at the end
# where it should be at the very first.
# and this will effect the visualization sequence later.
df_activeDistance_summary["sequence"] = range(1,len(df_activeDistance_summary)+1)
df_activeDistance_summary.loc[df_activeDistance_summary.index==7, 'sequence'] = 0
df_activeDistance_summary = df_activeDistance_summary.sort_values("sequence").drop('sequence', axis=1).reset_index(drop=True)
In [40]:
def px_lineActivePlot(data, x_axis, col_names, name_seq):
    fig = go.Figure(data=[
        go.Line(name=name, 
           x=data[x_axis], 
           y=data[col_name],
           text=round(data[col_name],1)
        ) for col_name, name in zip(col_names, name_seq)
    ])
    fig.update_layout(
    
)
    # Change the bar mode
    fig.update_layout(
        barmode='stack',
        title=f"{x_axis} vs {name_seq[0][name_seq[0].find(' '):]}",
        xaxis_title='Total Step',
        yaxis_title='Total Distance (Miles)')
    fig.show()
In [41]:
col_names_minutes = ['VeryActiveMinutes', 'FairlyActiveMinutes', 'LightlyActiveMinutes']
name_seq_minutes = ['Very Active Minutes', 'Fairly Active Minutes', 'Lightly Active Minutes']

px_lineActivePlot(df_activeMinutes_summary, 'CaloriesCluster', col_names_minutes, name_seq_minutes)
/opt/anaconda3/lib/python3.8/site-packages/plotly/graph_objs/_deprecations.py:378: DeprecationWarning:

plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.


In [42]:
col_names_dist = ['VeryActiveDistance', 'ModeratelyActiveDistance', 'LightActiveDistance']
name_seq_dist = ['Very Active Distance', 'Moderately Active Distance', 'Light Active Distance']

px_lineActivePlot(df_activeDistance_summary, 'CaloriesCluster', col_names_dist, name_seq_dist)

from the visualization above, we can conclude that the very active minutes or distance have a huge role on calories burn. This will strongly support our previous findings that lower steps with high distance lead to more calories burn.

Let's check on when user is most active

In [43]:
fig = px.line(
    df_hourlyIntensities.groupby('Time_').mean(), 
    x=df_hourlyIntensities.groupby('Time_').mean().index, 
    y="AverageIntensity", 
    title='Average intensity (Hourly)')

fig.show()

fig = px.line(
    df_minuteIntensitiesNarrow.groupby('Time_').mean(), 
    x=df_minuteIntensitiesNarrow.groupby('Time_').mean().index, 
    y="Intensity", 
    title='Average intensity (Minutes)')

fig.show()

so the user is mostly active from 17:00 to 19:00, From this info I can make a recommendation for when to do the marketing campaign. Next, let's take a look at the sleep feature, and see wether sleeping will effect how active you are for the day.

In [44]:
df_activity_join_sleep = df_sleepDay.merge(df_dailyActivity, left_on=['Id', 'Date_'], right_on=['Id', 'Date_'])

display(df_activity_join_sleep.head(2))
print(f'number of unique Id : {df_activity_join_sleep["Id"].nunique()}')
Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed Date_ Time__x ActivityDate TotalSteps TotalDistance ... LightlyActiveMinutes SedentaryMinutes Calories Time__y CaloriesCluster distanceCluster stepCluster dayofweek isWeekend TotalActiveMinutes
0 1503960366 2016-04-12 1 327 346 2016-04-12 2000-01-01 2016-04-12 13162 8.50 ... 328 728 1985 2000-01-01 1500 - 2000 > 8 Miles > 11k steps Tuesday False 366
1 1503960366 2016-04-13 2 384 407 2016-04-13 2000-01-01 2016-04-13 10735 6.97 ... 217 776 1797 2000-01-01 1500 - 2000 6 - 8 Miles 7.5k - 11k steps Wednesday False 257

2 rows × 28 columns

number of unique Id : 24

let's check on how total minutes asleep will effect total time in bed

In [45]:
fig = px.scatter(df_activity_join_sleep, 
                 x="TotalMinutesAsleep", 
                 y="TotalTimeInBed",
                 trendline="ols",
                 title="Total Minutes Asleep vs Total Time In Bed",
                 hover_data=["Id"]
)
fig.data[1].line.color = 'red'
fig.show()

so when you're asleep, it's also counted as time in bed. So there's no way total time in bed is lower than total minutes asleep. Next let's check on how sleep time will effect how active you are for the day.

In [46]:
# make it into hour will also make it easier to understand
df_activity_join_sleep['TotalHoursAsleepRounded'] = round(df_activity_join_sleep['TotalMinutesAsleep']/60)
In [47]:
colors = ['#3a5eff', '#ff3a3a', '#1ccc02'] # for better continuity from the visualization before
temp_df = df_activity_join_sleep.sort_values(by=['TotalHoursAsleepRounded'])

fig = go.Figure(data = [
    go.Box(
        x=temp_df['TotalHoursAsleepRounded'],
        y=temp_df[col_name],
        name=name,
        marker_color=color
    ) for col_name, name, color in zip(col_names_minutes, name_seq_minutes, colors)
])

fig.update_layout(
    title='Active Minutes vs Total Hours Asleep',
    yaxis_title='Active Minutes',
    xaxis_title='Total Hours Asleep',
    boxmode='group' # group together boxes of the different traces for each value of x
)

fig.show()

lightly active minutes really distort the other data that we already know is more important, especially the very active minute. Since it's the one that affect calories the most

In [48]:
colors = ['#3a5eff', '#ff3a3a', '#1ccc02']
temp_df = df_activity_join_sleep.sort_values(by=['TotalHoursAsleepRounded'])

fig = go.Figure(data = [
    go.Box(
        x=temp_df['TotalHoursAsleepRounded'].apply(str),
        y=temp_df[col_name],
        name=name,
        marker_color=color
    ) for col_name, name, color in zip(col_names_minutes[:2], name_seq_minutes[:2], colors[:2])
])

fig.update_layout(
    title='Active Minutes (Lightly Active Minutes Excluded) vs Total Hours Asleep',
    yaxis_title='Active Minutes',
    xaxis_title='Total Hours Asleep',
    boxmode='group' # group together boxes of the different traces for each value of x
)

fig.show()

Since we've exluded sedentary minutes in the analysis above, I'm going to check wether sedentary minutes will have any effect on sleep time

In [49]:
fig = px.box(df_activity_join_sleep, x='TotalHoursAsleepRounded', y='SedentaryMinutes')

fig.update_layout(
    title='Sedentary Minutes vs Total Hours Asleep',
    yaxis_title='Sedentary Minutes',
    xaxis_title='Total Hours Asleep',
    boxmode='group' # group together boxes of the different traces for each value of x
)

fig.show()

interestingly, user with 1 hours of sleep have more very active minutes. this might be because we only have small amount of data with extreme value (very low or very high) for hour of sleep. but, as expected, there's an increase of very active minutes for use that have 5-8 total hours of sleep. with this plot, we have a knowledge that if you sleep for more than 8 hours or less than 5 hours, it's less likely for you to have more very active minutes. And for the sedentary activity, It turns out that the user tends to have a bad sleep when they're doing lot of sedentary activity (little to no activity). This is an interesting findings and now I'm feeling very exited! Lets dive more into the sleeping feature. I have a hypothesis that sleeping schedule also have a role for how active you are. let's do just that! But first, let's try to identify when a user is considered "sleeping" from the data minutesSleep.

from the data time distribution back at the process step, we already an intuition that df_minuteSleep only have the record when the user is considered sleeping, because we can see that most of the data lies between sleep time. But, I will try prove it in this section with the log_id. but first, let's try to find someone with more than 1 total sleep records within a day and take it as a sample.

In [50]:
df_activity_join_sleep.loc[df_activity_join_sleep['TotalSleepRecords'] > 1].head(1)
Out[50]:
Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed Date_ Time__x ActivityDate TotalSteps TotalDistance ... SedentaryMinutes Calories Time__y CaloriesCluster distanceCluster stepCluster dayofweek isWeekend TotalActiveMinutes TotalHoursAsleepRounded
1 1503960366 2016-04-13 2 384 407 2016-04-13 2000-01-01 2016-04-13 10735 6.97 ... 776 1797 2000-01-01 1500 - 2000 6 - 8 Miles 7.5k - 11k steps Wednesday False 257 6.0

1 rows × 29 columns

In [51]:
minuteSleep_data_sample = df_minuteSleep.loc[(df_minuteSleep['Id'] == 1503960366) & 
    (df_minuteSleep['Date_'] == datetime(2016,4,13).date())]

print(f'number of unique logId : {minuteSleep_data_sample["logId"].nunique()}')
number of unique logId : 2

so data with 2 total sleep records have 2 unique logs id. so we can group by a data with Id and logId, and get the max time for wake up time, and min time for the sleep start time.

In [52]:
df_sleepTime = df_minuteSleep.groupby(['Id', 'logId']).agg({
    'Date_' : ['max'],
    'Time_' : ['min', 'max'],
}).reset_index()

# renaming columnm
new_col_names = list(df_sleepTime.columns.droplevel(1))
new_col_names[3:5] = ('sleepStart', 'sleepEnd')
df_sleepTime.columns = new_col_names

df_sleepTime.head(2)
Out[52]:
Id logId Date_ sleepStart sleepEnd
0 1503960366 11380564589 2016-04-12 2000-01-01 02:47:30 2000-01-01 08:32:30
1 1503960366 11388770715 2016-04-13 2000-01-01 03:08:30 2000-01-01 08:20:30
In [53]:
filter_column_list = ['Id', 'logId', 'Date_', 'sleepStart', 'sleepEnd', 'Calories']
df_sleepTime_with_calories = df_sleepTime.merge(
    df_dailyActivity, 
    left_on=['Id', 'Date_'], 
    right_on=['Id', 'Date_']
)[filter_column_list]

df_sleepTime_with_calories.head(2)
Out[53]:
Id logId Date_ sleepStart sleepEnd Calories
0 1503960366 11380564589 2016-04-12 2000-01-01 02:47:30 2000-01-01 08:32:30 1985
1 1503960366 11388770715 2016-04-13 2000-01-01 03:08:30 2000-01-01 08:20:30 1797
In [54]:
def getTimeHour(x):
    return x.hour

df_sleepTime_with_calories['sleepStartRounded'] = df_sleepTime_with_calories['sleepStart'].apply(getTimeHour)
df_sleepTime_with_calories['sleepEndRounded'] = df_sleepTime_with_calories['sleepEnd'].apply(getTimeHour)
In [55]:
fig = px.box(df_sleepTime_with_calories, x="sleepStartRounded", y="Calories")

fig.update_layout(
    title='Calories vs Sleep start time',
    yaxis_title='Calories',
    xaxis_title='Sleep start time',
)

fig.show()

fig = px.box(df_sleepTime_with_calories, x="sleepEndRounded", y="Calories")

fig.update_layout(
    title='Calories vs Wake up time',
    yaxis_title='Calories',
    xaxis_title='Wake up time',
)

fig.show()

this is strage, why? because our data shows that 75% of our data have atleast 6 hours of sleep, the fact that there's a lot of data with sleep start at hour 0 and a lot of data with sleep end at hour 23 doesn't make sense.

I WAS WRONG! later, I actually notice that this is not an error. I made a mistake on not considering that sleeping usually happends within 2 different days, so I should've group it by the sleep log_id. But, I won't delete this mistake that I've just made. because from this, I learn that I will always double check on interesting findings. Imagine if this is real life scenario, I didn't notice this mistake, and I must share this findings to the product team. The ending will not be good.

In [56]:
df_sleepTime = df_minuteSleep.groupby(['Id', 'logId']).agg({
    'date' : ['min', 'max'],
}).reset_index()

# renaming columnm
new_col_names = list(df_sleepTime.columns.droplevel(1))
new_col_names[2:4] = ('sleepStart', 'sleepEnd')
df_sleepTime.columns = new_col_names
df_sleepTime['Date'] = df_sleepTime['sleepEnd'].apply(lambda dt:dt.date())

display(df_sleepTime.head(2))
display(df_dailyActivity[['Id', 'ActivityDate', 'Calories']].head(2))
Id logId sleepStart sleepEnd Date
0 1503960366 11380564589 2016-04-12 02:47:30 2016-04-12 08:32:30 2016-04-12
1 1503960366 11388770715 2016-04-13 03:08:30 2016-04-13 08:20:30 2016-04-13
Id ActivityDate Calories
0 1503960366 2016-04-12 1985
1 1503960366 2016-04-13 1797
In [57]:
filtered_column_list = ['Id', 'logId', 'Date_', 'sleepStart', 'sleepEnd', 'Calories']
df_sleepTime_with_calories = df_sleepTime.merge(
    df_dailyActivity, 
    left_on=['Id', 'Date'], 
    right_on=['Id', 'Date_']
)

df_sleepTime_with_calories[filtered_column_list]
Out[57]:
Id logId Date_ sleepStart sleepEnd Calories
0 1503960366 11380564589 2016-04-12 2016-04-12 02:47:30 2016-04-12 08:32:30 1985
1 1503960366 11388770715 2016-04-13 2016-04-13 03:08:30 2016-04-13 08:20:30 1797
2 1503960366 11388770716 2016-04-13 2016-04-13 20:10:00 2016-04-13 21:43:00 1797
3 1503960366 11402722600 2016-04-15 2016-04-15 02:59:00 2016-04-15 10:20:00 1745
4 1503960366 11421831252 2016-04-16 2016-04-16 02:11:00 2016-04-16 06:59:00 1863
... ... ... ... ... ... ...
454 8792009665 11521265607 2016-04-30 2016-04-30 02:20:00 2016-04-30 08:19:00 2896
455 8792009665 11528665309 2016-05-01 2016-05-01 01:04:00 2016-05-01 09:50:00 1962
456 8792009665 11536572725 2016-05-02 2016-05-02 02:34:30 2016-05-02 09:36:30 2015
457 8792009665 11545482065 2016-05-03 2016-05-03 02:00:30 2016-05-03 11:04:30 2297
458 8792009665 11552534115 2016-05-04 2016-05-04 02:21:00 2016-05-04 10:03:00 2067

459 rows × 6 columns

In [58]:
def getTimeHour(x):
    return x.hour

df_sleepTime_with_calories['sleepStartRounded'] = df_sleepTime_with_calories['sleepStart'].apply(getTimeHour)
df_sleepTime_with_calories['sleepEndRounded'] = df_sleepTime_with_calories['sleepEnd'].apply(getTimeHour)
In [59]:
fig = px.box(df_sleepTime_with_calories, x="sleepStartRounded", y="Calories")

fig.update_layout(
    title='Calories vs Sleep start time',
    yaxis_title='Calories',
    xaxis_title='Sleep start time',
)

fig.show()

fig = px.box(df_sleepTime_with_calories, x="sleepEndRounded", y="Calories")

fig.update_layout(
    title='Calories vs Wake up time',
    yaxis_title='Calories',
    xaxis_title='Wake up time',
)

fig.show()

from the data above, we know that user who wake up earlier in the morning (especially at 4 to 5) tends to burn more calories

In [60]:
fig = px.histogram(df_sleepTime_with_calories, x="sleepStartRounded", nbins=20)

fig.update_layout(
    title='Sleep start data distribution',
    xaxis_title='Wake up time',
)

fig.show()

Next, let's take a look at our last data that we'll be analyzing, it's heartrate. We only have less than 50% of Id that have the data. But since this is an important feature, I will do a little bit of analysis and take a look at a sample from a user that has the highest and lowest average value of heartrate.

In [61]:
df_heartrate_seconds['Id'] = df_heartrate_seconds['Id'].apply(str)

df_heartrate_seconds.head(2)
Out[61]:
Id Time Value Date_ Time_
0 2022484408 2016-04-12 07:21:00 97 2016-04-12 2000-01-01 07:21:00
1 2022484408 2016-04-12 07:21:05 102 2016-04-12 2000-01-01 07:21:05
In [62]:
df_heartrate_seconds.groupby(['Id', 'Date_']).mean()
Out[62]:
Value
Id Date_
2022484408 2016-04-12 75.804177
2016-04-13 80.337584
2016-04-14 72.628597
2016-04-15 80.437382
2016-04-16 75.960547
... ... ...
8877689391 2016-05-08 72.550523
2016-05-09 89.615738
2016-05-10 71.544377
2016-05-11 89.149122
2016-05-12 71.087439

334 rows × 1 columns

In [63]:
fig = px.box(
    x=df_heartrate_seconds.groupby(['Id', 'Date_']).mean().index.get_level_values(0),
    y=df_heartrate_seconds.groupby(['Id', 'Date_']).mean()['Value']
)

fig.update_layout(
    title='Heart rate value range for each user',
    yaxis_title='Heart rate value Range',
    xaxis_title='User ID',
)

fig.show()
In [64]:
heartrate_seconds_sample_high = df_heartrate_seconds.loc[(df_heartrate_seconds['Id'] == '6775888955')] 
heartrate_seconds_sample_low = df_heartrate_seconds.loc[(df_heartrate_seconds['Id'] == '4388161847')] 
In [65]:
heartrate_seconds_sample_high.head(2)
Out[65]:
Id Time Value Date_ Time_
1699287 6775888955 2016-04-13 02:41:00 74 2016-04-13 2000-01-01 02:41:00
1699288 6775888955 2016-04-13 02:41:05 106 2016-04-13 2000-01-01 02:41:05
In [66]:
fig = px.line(
    heartrate_seconds_sample_high, 
    x='Time', 
    y="Value", 
    title='Heart rate data from a user with the highest average daily heart rate'
)

fig.show()

fig = px.line(
    heartrate_seconds_sample_low, 
    x='Time', 
    y="Value", 
    title='Heart rate data from a user with the lowest average daily heart rate'
)

fig.show()

I know that the visualization above is not optimal and hard to read, but from the user with high average heart rate, the data is not complete and have a lot of missing data. Not only we have some Id that doesn't have heartrate data, but the data also have a lot of missing data on it.

Share & Act

what to do in this section:

  • Understand visualization
  • Create effective visuals
  • Communicate to help others understand
  • Apply insights
  • Solve problems

Recommendation:

For Marketing Team
  • The best time to run the marketing campaign

    • People tends to be most active between 17:00 - 19:00, so doing a marketing campaign during that time might be benefitial. With an assumption that there will be more people out there (who's not a user of bellabeat yet) at that spesific time that's currently doing exerices and try to look for an app that can track daily acitivies.
    • More active people happends to start waking up between 04:00 - 07:00. we run the campaign here with an assumption that we will get more active people at this time that might catch interest for the product.
  • Who to target

    • Target people who have interest in running The app will track on your daily activities and have a detailed info for walking/running activity.
    • People who's looking to lose weight can also be our marketing campaign target.
    • Gym Member will also catch interest in our product, so they can track progress of their workout routine.
    • The data that the app has can help Someone that have a sleeping problem. we can give them reccomendation on what to do to improve sleep quality.
  • Educational content topics

    • Topics related to running
      • do atleast 5k steps a day (this is what average adults do daily) to have a better health or to achieve atleast 1,5k calories.
      • The intensity of the walk/run is more important than distance, if you're looking to lose weight
    • Topics related to sleeping
      • Waking up early leads to more active day
      • more sedentary activity will make you sleep worst from the duration point of view.
      • 6-8 hours of sleep to have a more active day
    • Make a content that promotes what the app can do. Since one of our target is active people that already done workouts and wanted to track their result. promoting what our app can track will also be able to catch their attention.
For Product Team
  • Users are not only burning their calories through walking/running. It's indicated from some user with almost no total active minutes can have more than 1.5k calories burn. So if it's possible we can detect other activities such as swimming, muscle building workouts, dancing/yoga, etc.
  • Improve the data collection method for heart rate data, it can be done by improving the heart rate monitoring system or encourage the user to monitor it. Heartrate monitor is one of the way to check how healthy a user is. With this data collected we can know wether a user needed to see a doctor or not just by looking at their resting heart rate. And by improving the data collection for this feature, it might save some life. We can have a notification system for wether we don't recieve their heart rate data trough their phone, maybe the product is not fitted well to the user so the hardware can't detect heartbeat.
  • Weight info have the same problem with heartrate, we need to Improve the data collection method for Weight data. Because with this data, we can track on a user weight loss journey. And the bellabeat main target audience is women, and women tends to have concern about their weight than men. If we can collect more data about user weight, we market the product even better! we can do this by giving the user notification that the system doesn't receive any weight data yet for the day.
For Data Team
  • Consider having naming convention for both table and column name. there's a table called sleepDay, but the column that hold the date is called "SleepDay" and it store the data with a datetime format "4/12/2016 12:00:00 AM", but on the other daily updated table, they all have the same column name called ActivityDay with a date format "4/12/2016". For table name, only 1 table naming is different, it's heartrate_seconds. All the other table name is using camel case naming convention.
  • Consider choosing a representitive column name. For this, I notice there's a column called "Value" from heartrate_seconds and minuteSleep table. for heartrate_sconds, I know that the value represent what is the user current heartrate at the time. But for minute minuteSleep, the value is from 1-3 and I have to do further research for what it means. And i found that the value 1 means the user is asleep, but what about 2 and 3?. My suggetion is change the column name from value to heartrate for heartrate_sconds table, and whatever the value in minuteSleep table represent.

Action

For both marketing and product team, I'll schedule a seperate meeting for each team, where I share what my findings are and what action that each team needed to do. For data team, I will also like to make a meeting about both of the points that I've just wrote and make a project planning on how and when to do that action, and I will be a part of it to make sure the Goal of each project is met.